NON LINEAR REGRESSION MODELS¶
Objective¶
This homework sheet will help reviewing the basic concepts associated with non-linear models. Please review the lectures, suggested readings, and additional resources before getting started on the HW.
Some questions in this assignment will require you to conduct independent research beyond the material covered in the recorded content.
Some of the libraries / links mentioned in APPLIED questions are only for your reference. Unless specifically indicated, feel free to utilise any Python library of your choice to carry out the assignment questions
PYGAM REFERENCE LINK:
https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html
STATSMODELS REFERENCE LINK:
https://www.statsmodels.org/dev/glm.html
PATSY REFERENCE LINK:
The following website has access to the relevant datasets from the recommended textbook: https://book.huihoo.com/introduction-to-statistical-learning/data.html
Marks Distribution
| Question | Marks |
|---|---|
| 1a | 0.5 |
| 1b | 0.5 |
| 1c | 0.5 |
| 1d | 0.5 |
| 1e | 0.5 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5a | 0.5 |
| 5b | 0.5 |
| 5c | 0.5 |
| 6a | 2 |
| 6b | 1 |
| 7a | 1 |
| 7b | 2 |
| 7c | 1 |
| 7d | 1 |
Conceptual¶
Question 1¶
Suppose that a curve $\hat{g}$ is computed to smoothly fit a set of n points using the following formula:
$$ \hat{g} = \arg\min_{g}(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(m)} (x)]^2 dx ) $$ where $g^{(m)}$ represents the $m^{th}$ derivative of $g$ (and $g^{(0)} = g$. Provide a description (or sketch) of $\hat{g}$ in each of the following scenarios. where $g^{(m)}$ represents the m th derivative of g
(a) $\lambda$ = $\infty$, $m = 0$.
ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\lambda \int [g(x)]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. We can say the $\hat{g}$ is a function which minimises the absolute value of the function itself. Hence the $ \hat{g}(x) = 0 $ will be the most likely function.
(b) $\lambda$ = $\infty$, $m = 1$.
ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\int [g(x)^{(1)}]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. With $\lambda = \infty$ and $m = 1$, the penalty term $\lambda \int [g^{(1)}(x)]^2 dx$ enforces that the first derivative of $g(x)$ is minimized, which means that $g(x)$ should be as flat as possible (since the integral of the square of the first derivative of a flat function over its domain is minimized by being flat). So, $\hat{g}$ would be a horizontal line that best fits the data points. Hence the $\hat{g}(x) = \bar{y}$ where $\bar{y}$ is mean of all observations($y_{i}$) will be the most likely function provided the g(x) is polynomial. .
(c) $\lambda$ = $\infty$, $m = 2$.
ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\int [g(x)^{(2)}]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. Here, the penalty term $\lambda \int [g^{(2)}(x)]^2 dx$ enforces that the second derivative of $g(x)$ is minimized, which means that $g(x)$ should be a straight line (since the integral of the square of the second derivative of a straight line over its domain is minimized by being a straight line- i.e,$ [g(x)^{(2)}]^2 = 0 $). So, $\hat{g}$ would be a straight line that best fits the data points. Hence the $\hat{g}(x) = \beta_0 + \beta_1 x$ with optimum $\beta_1 and \beta_2 $ will be the most likely function provided the g(x) is polynomial.
(d) $\lambda$ = $\infty$, $m = 3$.
ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\int [g(x)^{(3)}]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. the penalty term $\lambda \int [g^{(m)}(x)]^2 dx$ enforces that the third derivative of $g(x)$ is minimized, which means that $g(x)$ should be a simple curve without too much curvature. So, $\hat{g}$ would be a curve that best fits the data points without overly complex shapes. This will most likey will have $[g(x)^{(3)}]^2 = o$ for all x. Hence the equation will mostly of the form $\hat{g}(x) = \beta_0 + \beta_1 x + \beta_2 x^{2}$ with optimum $\beta_i$s (Assuming g(x) is polynomial)
(e) $\lambda$ = 0, $m = 3$.
ANSWER When $\lambda = 0$, the regularization term has no effect, and the model reduces to minimizing the sum of squared errors(RSS) without any penalty for the curvature of the curve. So, $\hat{g}$ would be a curve that closely fits the data points, potentially leading to overfitting if the model is too complex.
Question 2¶
For (1) step functions, (2) local regression, and (3) GAMs: provide one situation when using that particular model is advantageous. For instance, “GAMs is likely the preferred option when the question I’m interested in addressing involves…”.
ANSWER
Step Functions: Step functions are useful when you want to represent a predictor's effect that changes abruptly at certain thresholds. By using step functions, we can fit different polynomial functions to different regions of the dataset, allowing the regression model to capture nonlinear relationships more accurately. For example, in marketing, you might use a step function to model the effect of price changes on demand, where the demand remains constant until a price threshold is reached, after which it drops suddenly.
Local Regression: Local regression, such as LOESS (locally estimated scatterplot smoothing), is advantageous when you have non-linear relationships in your data but want a more flexible approach than global polynomial regression. Local regression is a non-parametric regression technique used to model the relationship between variables when the relationship is not assumed to be linear. It provides a flexible way to fit a smooth curve to the data by assigning higher weights to nearby data points and lower weights to more distant points. It is a memory-based procedure because we need all the training data each time we wish to compute a prediction
Generalized Additive Models (GAMs): GAMs are advantageous when the relationship between the predictor and the response variable is non-linear and you want a flexible model that can capture complex interactions and non-linearities without specifying the exact form of the relationship. GAMs are also useful when you have multiple predictors and you want to understand their individual effects on the response variable while accounting for potential interactions. The main limitation of GAMs is that the model is restricted to be additive. However, we can manually add interaction terms to the GAM model by including additional predictors of the form $X_j x X_k$. In GAMs, the relationship between each predictor variable and the response is modeled using smooth functions, allowing for more flexibility and capturing intricate data patterns.
Question 3¶
Define the following types of smooth functions from PYGAM library (https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html ) and list at least one relevant potential use:
a) l() linear terms
b) s() spline terms
c) f() factor terms
d) te() Tensor product interaction terms
ANSWER In the context of the PYGAM library for generalized additive models (GAMs), each function type serves a specific purpose in modeling nonlinear relationships between predictors and the target variable.
a) l(feature, lam=0.6, penalties='auto', verbose=False) linear terms: These are simple linear functions of a single predictor. They model a linear relationship between the predictor and the target variable.
Example: Suppose you're modeling the relationship between temperature and energy consumption. You might use a linear term to model how energy consumption changes with temperature in a linear manner.
b) s(feature, n_splines=20, spline_order=3, lam=0.6, penalties='auto', constraints=None, dtype='numerical', basis='ps', by=None, edge_knots=None, verbose=False) spline terms: Splines are flexible functions that allow for nonlinear relationships with the predictor. They are especially useful when the relationship is not linear but still smooth.
Example: In a study of the effect of age on income, you might use a spline term to capture the nonlinear relationship, as income might increase rapidly in early career stages, level off, and potentially decline post-retirement.
c) f(feature, lam=0.6, penalties='auto', coding='one-hot', verbose=False) factor terms: Factor terms are used for categorical predictors. They allow each category to have its own effect on the target variable, which is particularly useful when the effect of the category is not linear or when there are interactions between categories.
Example: If you're studying the impact of different types of marketing strategies on sales, you might use factor terms for the different marketing channels to capture how each channel affects sales differently.
d) te() Tensor product interaction terms: Tensor product interactions allow for interactions between two or more predictors. They are useful when the effect of one predictor on the target variable depends on the value of another predictor.
Example: In a study of the effect of both temperature and humidity on plant growth, you might use a tensor product interaction term to capture how the effect of temperature on growth varies with different levels of humidity.
Each of these function types provides a way to model complex relationships in GAMs, allowing for more flexibility and accuracy in capturing the underlying patterns in the data.
Question 4¶
Briefly explain what spatial and temporal autocorrelation are? List one way in which GAMs could help to account for this (e.g. indicate the type of smooth function you would use or potential types of correlation structures).
ANSWER Spatial autocorrelation refers to the degree to which the values of a variable at nearby locations in space are similar to each other. Temporal autocorrelation, on the other hand, refers to the degree to which the values of a variable at nearby points in time are similar to each other.
Generalized additive models (GAMs) can help account for spatial or temporal autocorrelation by incorporating smooth functions of space or time. For spatial autocorrelation, one might use a smooth function of latitude and longitude. For temporal autocorrelation, one might use a smooth function of time. Additionally, one could also consider using structured correlation functions, such as the Matérn covariance function, to explicitly model the spatial or temporal correlation in the data.
Alternatively, one could use structured correlation functions, such as the exponential, spherical, or Matérn covariance functions, to model the spatial or temporal correlation explicitly. These functions define the correlation between two points as a function of their distance or time difference, allowing for more flexible modeling of autocorrelation structures.
Question 5¶
Consider two curves $\hat{g}1$ and $\hat{g}2$, defined by
$$ \hat{g}1 = \arg\min_{g}(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(3)} (x)]^2 dx ) $$
$$ \hat{g}2 = \arg\min_{g}(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(4)} (x)]^2 dx ) $$
where $g^{(m)}$ represents the $m^{th}$ derivative of $g$. Now, answer the following questions (and include a brief explanation):
a) As $\lambda \rightarrow \infty$, will $\hat{g}1$ or $\hat{g}2$ have smaller training RSS?
ANSWER As $\lambda \rightarrow \infty$, the penalty term in both $\hat{g}1$ and $\hat{g}2$ will dominate the objective function. However, since $\hat{g}1$ includes a penalty on the third derivative of $g$, it will tend to produce smoother functions compared to $\hat{g}2$, which penalizes the fourth derivative. Smoother functions are less likely to overfit the training data well , so $\hat{g}2$ is likely to have a smaller training RSS as $\lambda \rightarrow \infty$ .
b) As $\lambda \rightarrow \infty$, will $\hat{g}1$ or $\hat{g}2$ have smaller test RSS?
ANSWER As $\lambda \rightarrow \infty$, the penalty term in both $\hat{g}1$ and $\hat{g}2$ will dominate the objective function. However, since $\hat{g}1$ includes a penalty on the third derivative of $g$, it will tend to produce smoother functions compared to $\hat{g}2$, which penalizes the fourth derivative. Smoother functions are more likely to fit the test data well without overfitting, so $\hat{g}1$ is likely to have a smaller test RSS as $\lambda \rightarrow \infty$ provided that the actual data is not having too complex patterns (which can increase bias).
c) For $\lambda= 0$, will $\hat{g}1$ or $\hat{g}2$ have smaller training RSS?
ANSWER For $\lambda = 0$, there is no penalty term in the objective function, so both $\hat{g}1$ and $\hat{g}2$ will try to fit the training data as closely as possible without any regularization. Hence the both $\hat{g}1$ and $\hat{g}2$ should result in same result. Hence both equations are same, we should have same test or training RSS
APPLIED¶
Question 6¶
This question relates to the College data set.
pre process the data if required
a) Fit a GAM on the training data, using out-of-state tuition as the response and only subset of features as the predictors ( Select 'private', 'Room.Board' , 'PhD' , 'perc.alumni' , 'Expend' , 'Grad.rate' ) Plot the results and briefly explain your findings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
college = pd.read_csv('College.csv')
preprocessed_college = college.copy()
preprocessed_college['private'] = college['Private'].map({'Yes': 1, 'No': 0})
preprocessed_college
| Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | private | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 | 1 |
| 1 | Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 | 1 |
| 2 | Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 | 1 |
| 3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 | 1 |
| 4 | Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 772 | Worcester State College | No | 2197 | 1515 | 543 | 4 | 26 | 3089 | 2029 | 6797 | 3900 | 500 | 1200 | 60 | 60 | 21.0 | 14 | 4469 | 40 | 0 |
| 773 | Xavier University | Yes | 1959 | 1805 | 695 | 24 | 47 | 2849 | 1107 | 11520 | 4960 | 600 | 1250 | 73 | 75 | 13.3 | 31 | 9189 | 83 | 1 |
| 774 | Xavier University of Louisiana | Yes | 2097 | 1915 | 695 | 34 | 61 | 2793 | 166 | 6900 | 4200 | 617 | 781 | 67 | 75 | 14.4 | 20 | 8323 | 49 | 1 |
| 775 | Yale University | Yes | 10705 | 2453 | 1317 | 95 | 99 | 5217 | 83 | 19840 | 6510 | 630 | 2115 | 96 | 96 | 5.8 | 49 | 40386 | 99 | 1 |
| 776 | York College of Pennsylvania | Yes | 2989 | 1855 | 691 | 28 | 63 | 2988 | 1726 | 4990 | 3560 | 500 | 1250 | 75 | 75 | 18.1 | 28 | 4509 | 99 | 1 |
777 rows × 20 columns
preprocessed_college = preprocessed_college[['private', 'Room.Board' , 'PhD' , 'perc.alumni' , 'Expend' , 'Grad.Rate', 'Outstate']]
# checking for any anomaly in data
plt.figure(figsize=(16, 8))
sns.boxplot(data=preprocessed_college)
plt.show()
preprocessed_college.describe()
| private | Room.Board | PhD | perc.alumni | Expend | Grad.Rate | Outstate | |
|---|---|---|---|---|---|---|---|
| count | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.00000 | 777.000000 |
| mean | 0.727156 | 4357.526384 | 72.660232 | 22.743887 | 9660.171171 | 65.46332 | 10440.669241 |
| std | 0.445708 | 1096.696416 | 16.328155 | 12.391801 | 5221.768440 | 17.17771 | 4023.016484 |
| min | 0.000000 | 1780.000000 | 8.000000 | 0.000000 | 3186.000000 | 10.00000 | 2340.000000 |
| 25% | 0.000000 | 3597.000000 | 62.000000 | 13.000000 | 6751.000000 | 53.00000 | 7320.000000 |
| 50% | 1.000000 | 4200.000000 | 75.000000 | 21.000000 | 8377.000000 | 65.00000 | 9990.000000 |
| 75% | 1.000000 | 5050.000000 | 85.000000 | 31.000000 | 10830.000000 | 78.00000 | 12925.000000 |
| max | 1.000000 | 8124.000000 | 103.000000 | 64.000000 | 56233.000000 | 118.00000 | 21700.000000 |
sns.pairplot(data=preprocessed_college)
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.PairGrid at 0x10777ad10>
!pip install pygam
Requirement already satisfied: pygam in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (0.9.1) Requirement already satisfied: numpy>=1.25 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from pygam) (1.25.2) Requirement already satisfied: progressbar2<5.0.0,>=4.2.0 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from pygam) (4.4.2) Requirement already satisfied: scipy<1.12,>=1.11.1 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from pygam) (1.11.3) Requirement already satisfied: python-utils>=3.8.1 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from progressbar2<5.0.0,>=4.2.0->pygam) (3.8.2) Requirement already satisfied: typing-extensions>3.10.0.2 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from python-utils>=3.8.1->progressbar2<5.0.0,>=4.2.0->pygam) (4.7.1)
from pygam import LinearGAM
features = ['private', 'Room.Board', 'PhD', 'perc.alumni', 'Expend', 'Grad.Rate']
X = preprocessed_college[features]
Y = preprocessed_college["Outstate"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
lams = np.logspace(-3, 3, 20) * X_train.shape[1]
splines = list(range(5, 55, 5))
best_mse = np.inf
best_lambda = None
best_spline = 5
for lam in lams:
for spline in splines:
gam = LinearGAM(n_splines=spline, lam=lam).fit(X_train.values, y_train.values)
y_pred = gam.predict(X_test.values)
mse = mean_squared_error(y_test, y_pred)
if mse < best_mse:
best_mse = mse
best_lambda = lam
best_spline = spline
print(f"Best Lambda: {best_lambda}")
print(f"Best no of Splines: {best_spline}")
print(f"Best MSE: {best_mse}")
final_gam = LinearGAM(n_splines=best_spline, lam=best_lambda).fit(X_train.values, y_train.values)
# Partial Dependence Plots
for i, term in enumerate(final_gam.terms):
if term.isintercept:
continue
XX = final_gam.generate_X_grid(term=i)
pdep, confi = final_gam.partial_dependence(term=i, X=XX, width=0.95)
# Get the feature name from the DataFrame
feature_name = X.columns[i]
# Plot partial dependence and confidence intervals
plt.figure()
plt.rcParams['figure.figsize'] = (10, 6)
plt.plot(XX[:, i], pdep)
plt.fill_between(XX[:, i], confi[:, 0], confi[:, 1], color='r', alpha=0.3)
plt.title(f'Partial Dependence Plot for {feature_name}')
plt.xlabel(feature_name)
plt.ylabel('Predicted out-of-state Tuition')
plt.show()
Best Lambda: 2.0158909717702684 Best no of Splines: 15 Best MSE: 3374555.644641011
We can see how the relationship changes for each feature
- For private, we can see that the private collages have more out of state tution in general.
- We can see that for room board, we can see that the relationship is not very linear, but seems to be increasing with room board in general.
- We can see out of state tution is almost constant with PhD with small differences
- perventage alumni also seems to have small positive correlation.
- The plot with expediture seems to increase the out of state tution. But later it seems to stabilise as expenditure becomes more.
- Grad rate seems to vary in a sine wave like manner increasing in first half and then decreasing in next half.
b) For which variables, if any, is there evidence of a non-linear relationship with the response?
From the above plots, Grad rate, Expend, percentage alumni, PhD, Room board seems to show very clear pattern which is not linear.
Question 7¶
Answer the questions below using the following dataset (WEEK3_Q7 CSV provided with the learner template):
df2 = pd.read_csv('Week3_Q7.csv')
df2.head(10)
| Year | Route | statenum | richness | abundance | Active | Latitude | Longitude | RouteTypeID | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1966 | 2 | 25 | 41 | 491 | 1 | 30.375543 | -86.275832 | 1 |
| 1 | 1966 | 3 | 25 | 40 | 292 | 1 | 30.519316 | -86.480913 | 1 |
| 2 | 1966 | 4 | 25 | 51 | 491 | 1 | 30.528840 | -85.133454 | 1 |
| 3 | 1966 | 6 | 25 | 52 | 851 | 1 | 30.595019 | -84.041309 | 1 |
| 4 | 1966 | 10 | 25 | 56 | 458 | 1 | 30.273463 | -83.856968 | 1 |
| 5 | 1966 | 11 | 25 | 43 | 360 | 1 | 29.668699 | -83.378206 | 1 |
| 6 | 1966 | 12 | 25 | 45 | 344 | 1 | 30.502624 | -82.733741 | 1 |
| 7 | 1966 | 13 | 25 | 57 | 599 | 0 | 29.546846 | -82.263030 | 1 |
| 8 | 1966 | 14 | 25 | 52 | 485 | 1 | 29.191620 | -82.394802 | 1 |
| 9 | 1966 | 16 | 25 | 46 | 1452 | 0 | 28.036316 | -82.505005 | 1 |
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2824 entries, 0 to 2823 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 2824 non-null int64 1 Route 2824 non-null int64 2 statenum 2824 non-null int64 3 richness 2824 non-null int64 4 abundance 2824 non-null int64 5 Active 2824 non-null int64 6 Latitude 2824 non-null float64 7 Longitude 2824 non-null float64 8 RouteTypeID 2824 non-null int64 dtypes: float64(2), int64(7) memory usage: 198.7 KB
df2.nunique()
Year 50 Route 123 statenum 1 richness 59 abundance 1000 Active 2 Latitude 119 Longitude 122 RouteTypeID 1 dtype: int64
df2.describe(include='all')
| Year | Route | statenum | richness | abundance | Active | Latitude | Longitude | RouteTypeID | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2824.000000 | 2824.000000 | 2824.0 | 2824.000000 | 2824.000000 | 2824.000000 | 2824.000000 | 2824.000000 | 2824.0 |
| mean | 1995.283286 | 171.122167 | 25.0 | 47.113314 | 689.820113 | 0.830028 | 28.724434 | -82.661525 | 1.0 |
| std | 13.169235 | 303.107324 | 0.0 | 9.066053 | 354.727766 | 0.375674 | 1.659540 | 1.839859 | 0.0 |
| min | 1966.000000 | 1.000000 | 25.0 | 19.000000 | 147.000000 | 0.000000 | 24.597703 | -87.407943 | 1.0 |
| 25% | 1987.000000 | 19.000000 | 25.0 | 41.000000 | 453.750000 | 1.000000 | 27.426641 | -83.856968 | 1.0 |
| 50% | 1997.000000 | 52.000000 | 25.0 | 47.000000 | 615.000000 | 1.000000 | 28.785686 | -82.118666 | 1.0 |
| 75% | 2006.000000 | 87.000000 | 25.0 | 53.000000 | 809.000000 | 1.000000 | 30.292951 | -81.418950 | 1.0 |
| max | 2015.000000 | 925.000000 | 25.0 | 79.000000 | 4168.000000 | 1.000000 | 30.965694 | -80.203285 | 1.0 |
import matplotlib.pyplot as plt
correlation_matrix = df2.drop(["statenum", "RouteTypeID"], axis=1).corr()
plt.figure(figsize=(16, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix')
plt.show()
sns.pairplot(data=df2)
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.PairGrid at 0x30a7aaef0>
a) Only by inspecting the plots shown below: Is there any temporal or spatial trend in the response variable richness?
sns.scatterplot(x=df2.Year,y=df2.richness)
<Axes: xlabel='Year', ylabel='richness'>
ANSWER
In the plot we cannot see a strong relationship for the response variable richness as the correlation seems to be -0.15. But we can see there is a week assosiation which makes the variance of data per line seems to increase over time and the trend seems to decrease the average value of richness over time. On detailed inspection we can see a wave like nature to the richness as the time moves on. This could be a indication of weak temperal association. We can also see the variation of the data in same year. This indicate there could be a spatial trend in the resonse variable with may increase with time.
b) Now, fit a Poisson GAM with the following structure E(richness) = exp(f(year))*exp(g(location)). Note that location is actually two features: Longitude and Latitude. Name this model M1.
Plot the model, get the summary of M1, and answer: Is there evidence for effects of location and year?
X = df2[['Year', 'Longitude', 'Latitude']]
y = df2['richness']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from pygam import PoissonGAM, s, te
import pandas as pd
import matplotlib.pyplot as plt
lams = np.logspace(-3, 3, 20) * X_train.shape[1]
splines = list(range(20, 28, 1))
term = s(0) + te(1, 2)
best_mse = np.inf
best_spline = 1
M1 = None
# for lam in lams:
for spline in splines:
gam = PoissonGAM(term, n_splines=spline).gridsearch(X_train.values, y_train.values)
y_pred = gam.predict(X_test.values)
mse = mean_squared_error(y_test, y_pred)
if mse < best_mse:
M1 = gam
best_mse = mse
best_spline = spline
print(f"Best Lambda: {M1.lam}")
print(f"Best no of Splines: {best_spline}")
print(f"Best MSE: {best_mse}")
100% (11 of 11) |########################| Elapsed Time: 0:00:07 Time: 0:00:070:00 100% (11 of 11) |########################| Elapsed Time: 0:00:07 Time: 0:00:070:00 100% (11 of 11) |########################| Elapsed Time: 0:00:10 Time: 0:00:100:00 100% (11 of 11) |########################| Elapsed Time: 0:00:11 Time: 0:00:110:01 100% (11 of 11) |########################| Elapsed Time: 0:00:16 Time: 0:00:160:01 100% (11 of 11) |########################| Elapsed Time: 0:00:15 Time: 0:00:150:01 100% (11 of 11) |########################| Elapsed Time: 0:00:19 Time: 0:00:190:01 100% (11 of 11) |########################| Elapsed Time: 0:00:23 Time: 0:00:230:02
Best Lambda: [[0.015848931924611134], [[0.015848931924611134], [0.015848931924611134]]] Best no of Splines: 24 Best MSE: 31.934568944096114
XX = M1.generate_X_grid(term=0)
pdep, confi = M1.partial_dependence(term=0, X=XX, width=0.95)
feature_name = X.columns[0]
plt.figure()
title = f'Predicted Richness'
plt.rcParams['figure.figsize'] = (10, 6)
plt.plot(XX[:, 0], pdep)
plt.fill_between(XX[:, 0], confi[:, 0], confi[:, 1], color='r', alpha=0.3)
plt.title(f'Partial Dependence Plot for {feature_name}')
plt.xlabel(feature_name)
plt.ylabel(title)
plt.show()
def genrate_partial_dependence_graph(M, term_index, predictor_columns=[], response_column="Richness"):
feature_index = M.terms[term_index].feature
XX = M.generate_X_grid(term=term_index)
pdep, confi = M.partial_dependence(term=term_index, X=XX, width=0.95)
# combined_data = np.concatenate((XX, pdep, confi), axis=1)
if len(predictor_columns) == 0:
predictor_columns = X.columns
data = {predictor_columns[i]: XX[:, i] for i in feature_index}
data.update({response_column: pdep, 'confi_low': confi[:, 0], 'confi_high': confi[:, 1]})
combined_df = pd.DataFrame(data)
for index in feature_index:
feature_name = predictor_columns[index]
sorted_df = combined_df.sort_values(by=[feature_name])
plt.figure()
plt.rcParams['figure.figsize'] = (10, 6)
plt.plot(sorted_df[feature_name], sorted_df[response_column])
plt.fill_between(sorted_df[feature_name], sorted_df["confi_low"], sorted_df["confi_high"], color='r', alpha=0.3)
plt.title(f'Partial Dependence Plot for {feature_name}')
plt.xlabel(f'Predicted {response_column}')
plt.ylabel(title)
plt.show()
genrate_partial_dependence_graph(M1, 1, X.columns, "Richness")
def generate_interaction_plot(M, i, predictor_columns=[], response_column="Richness"):
XX = M.generate_X_grid(term=i, meshgrid=True)
feature_index = M.terms[i].feature
pdep, confi = M.partial_dependence(term=i, X=XX, width=0.95, meshgrid=True)
if len(predictor_columns) == 0:
predictor_columns = X.columns
# print(pdep.shape, XX[0].shape)
# Create a 3D plot
fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface
surf = ax.plot_surface(XX[0].reshape(pdep.shape),
XX[1].reshape(pdep.shape),
pdep, cmap='viridis', edgecolor='none')
# Add confidence intervals
ax.plot_wireframe(XX[0].reshape(pdep.shape),
XX[1].reshape(pdep.shape),
confi[:, :, 0], color='r', alpha=0.3)
ax.plot_wireframe(XX[0].reshape(pdep.shape),
XX[1].reshape(pdep.shape),
confi[:, :, 1], color='r', alpha=0.3)
# Set labels and title
ax.set_xlabel(predictor_columns[feature_index[0]])
ax.set_ylabel(predictor_columns[feature_index[1]])
ax.set_zlabel( f'Predicted {response_column}')
ax.set_title(f'Partial Dependence Plot for {predictor_columns[feature_index[0]]} and {predictor_columns[feature_index[1]]}')
ax.set_box_aspect(None, zoom=0.9)
# Show the plot
plt.show()
generate_interaction_plot(M1, 1, X.columns)
M1.summary()
PoissonGAM
=============================================== ==========================================================
Distribution: PoissonDist Effective DoF: 126.7811
Link Function: LogLink Log Likelihood: -7142.2696
Number of Samples: 2259 AIC: 14538.1014
AICc: 14553.5512
UBRE: 2.808
Scale: 1.0
Pseudo R-Squared: 0.6373
==========================================================================================================
Feature Function Lambda Rank EDoF P > x Sig. Code
================================= ==================== ============ ============ ============ ============
s(0) [0.0158] 24 24.0 2.24e-10 ***
te(1, 2) [0.0158 0.0158] 576 102.8 0.00e+00 ***
intercept 1 0.0 1.59e-09 ***
==========================================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
which can cause p-values to appear significant when they are not.
WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
known smoothing parameters, but when smoothing parameters have been estimated, the p-values
are typically lower than they should be, meaning that the tests reject the null too readily.
/var/folders/fq/t1vhfgh95v33mglpnsqgnyyh0000gq/T/ipykernel_9609/3019879061.py:1: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. Please do not make inferences based on these values! Collaborate on a solution, and stay up to date at: github.com/dswah/pyGAM/issues/163 M1.summary()
Yes. We can see how the richness show a wave like variation with respect to time and we can see how the location affects the richness. We can note that the confidence increses in area where we have more data. Also note that p-value is significantly low for the feature functions. Hence the evidence is clearly present.
c) Now, fit another Poisson GAM model with the same structure as in M1 , name this model M2 but using interactions between Year, Latitude and Longitude
from pygam import PoissonGAM, s, te
import pandas as pd
import matplotlib.pyplot as plt
splines = list(range(6, 12, 1))
term = te(0, 1, 2)
best_mse = np.inf
best_spline = 1
M2 = None
# for lam in lams:
for spline in splines:
gam = PoissonGAM(term, n_splines=spline).gridsearch(X_train.values, y_train.values)
y_pred = gam.predict(X_test.values)
mse = mean_squared_error(y_test, y_pred)
if mse < best_mse:
M2 = gam
best_mse = mse
best_spline = spline
print(f"Best Lambda: {M2.lam}")
print(f"Best no of Splines: {best_spline}")
print(f"Best MSE: {best_mse}")
100% (11 of 11) |########################| Elapsed Time: 0:00:03 Time: 0:00:030:00 100% (11 of 11) |########################| Elapsed Time: 0:00:06 Time: 0:00:060:00 100% (11 of 11) |########################| Elapsed Time: 0:00:10 Time: 0:00:100:00 100% (11 of 11) |########################| Elapsed Time: 0:00:23 Time: 0:00:230:02 100% (11 of 11) |########################| Elapsed Time: 0:00:39 Time: 0:00:390:03 100% (11 of 11) |########################| Elapsed Time: 0:01:10 Time: 0:01:100:06
Best Lambda: [[[0.001], [0.001], [0.001]]] Best no of Splines: 9 Best MSE: 34.875029481790506
def generate_interaction_plot_4d(M, i, predictor_columns=[], response_column="Richness"):
XX = M.generate_X_grid(term=i, meshgrid=True)
feature_index = M.terms[i].feature
pdep, confi = M.partial_dependence(term=i, X=XX, width=0.95, meshgrid=True)
# print(confi.shape)
# print(confi)
if len(predictor_columns) == 0:
predictor_columns = X.columns
# Create a 4D plot
fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface
surf = ax.scatter(XX[1].reshape(pdep.shape),
XX[2].reshape(pdep.shape),
pdep, c=XX[0].reshape(pdep.shape), cmap='viridis', edgecolor='none')
# Set labels and title
ax.set_xlabel(predictor_columns[feature_index[1]])
ax.set_ylabel(predictor_columns[feature_index[2]])
ax.set_zlabel(f'Predicted {response_column}')
ax.set_title(f'Partial Dependence Plot for {predictor_columns[feature_index[0]]} and {predictor_columns[feature_index[1]]}')
ax.set_box_aspect(None, zoom=0.9)
# Add color bar
cbar = plt.colorbar(surf)
cbar.set_label(predictor_columns[feature_index[0]])
# Show the plot
plt.show()
generate_interaction_plot_4d(M2, 0, X.columns)
genrate_partial_dependence_graph(M2, 0, X.columns, "Richness")
M2.summary()
PoissonGAM
=============================================== ==========================================================
Distribution: PoissonDist Effective DoF: 275.2661
Link Function: LogLink Log Likelihood: -7087.5554
Number of Samples: 2259 AIC: 14725.6429
AICc: 14802.9481
UBRE: 2.9436
Scale: 1.0
Pseudo R-Squared: 0.6643
==========================================================================================================
Feature Function Lambda Rank EDoF P > x Sig. Code
================================= ==================== ============ ============ ============ ============
te(0, 1, 2) [0.001 0.001 0.001] 729 275.3 0.00e+00 ***
intercept 1 0.0 8.06e-02 .
==========================================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
which can cause p-values to appear significant when they are not.
WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
known smoothing parameters, but when smoothing parameters have been estimated, the p-values
are typically lower than they should be, meaning that the tests reject the null too readily.
/var/folders/fq/t1vhfgh95v33mglpnsqgnyyh0000gq/T/ipykernel_9609/4283885535.py:1: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. Please do not make inferences based on these values! Collaborate on a solution, and stay up to date at: github.com/dswah/pyGAM/issues/163 M2.summary()
d) Compare M1 and M2. Note that you can use the $R^2$, AIC values, test error, among other approaches discussed in class. Which model should we select?
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
y_pred_m1 = M1.predict(X_test)
y_pred_m2 = M2.predict(X_test)
r2_m1 = r2_score(y_test, y_pred_m1)
r2_m2 = r2_score(y_test, y_pred_m2)
print("R-squared for M1:", r2_m1)
print("R-squared for M2:", r2_m2)
print("AIC for M1:", M1.statistics_['AIC'])
print("AIC for M2:", M2.statistics_['AIC'])
mse_m1 = mean_squared_error(y_test, y_pred_m1)
mse_m2 = mean_squared_error(y_test, y_pred_m2)
print("MSE for M1:", mse_m1)
print("MSE for M2:", mse_m2)
print("UBRE for M1:", M1.statistics_['UBRE'])
print("UBRE for M2:", M2.statistics_['UBRE'])
print("Log Likelihood for M1:", M1.statistics_['loglikelihood'])
print("Log Likelihood for M2:", M2.statistics_['loglikelihood'])
print("deviance for M1:", M1.statistics_['deviance'])
print("deviance for M2:", M2.statistics_['deviance'])
R-squared for M1: 0.6033590739516503 R-squared for M2: 0.5668373036806489 AIC for M1: 14538.101402113918 AIC for M2: 14725.642880036276 MSE for M1: 31.934568944096114 MSE for M2: 34.875029481790506 UBRE for M1: 2.807954648511525 UBRE for M2: 2.943558678615654 Log Likelihood for M1: -7142.269649208129 Log Likelihood for M2: -7087.5553555657225 deviance for M1: 1470.1826058108095 deviance for M2: 1360.754018525999
As per the scores,
- R-squared : M1 has higher score and hence better.
- AIC: M1 has lower score and hence Better.
- MSE: M1 has lower MSE and hence Better.
- UBRE: M1 has lower UBRE hence better.
- Log Likelihood: M1 has better Log Likelihood and hence better.
- Deviance Explained: M1 has higher deviance Explained and hence better.
Overall M1 is better model than M2 and therefore we should select M1